Comparative Analysis of TMP Databases¶

Project: Digital Teotihuacan Mapping Project (TMP) - Phase 1

Objective: This notebook synthesizes the results from the entire profiling pipeline to conduct a comparative analysis of the four legacy databases and the two wide-format benchmark databases. Its primary goal is to use quantitative data to compare these database architectures on three key axes: Structural Complexity, Resource Usage, and Query Performance.

The findings from this notebook will directly inform the final recommendation for the Phase 2 unified database architecture.


1. Setup and Configuration¶

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
from IPython.display import display, Markdown
import plotly.express as px
import plotly.graph_objects as go
from sklearn.preprocessing import MinMaxScaler
import os

# --- Path Definitions ---
# Use more robust path construction based on notebook file location
try:
    # Try to get notebook's directory (works in most Jupyter environments)
    NOTEBOOK_DIR = Path(os.getcwd())
    # Find the project root by looking for specific markers
    current_path = NOTEBOOK_DIR
    while current_path != current_path.parent:
        if (current_path / "TASKS.md").exists() or (current_path / "pyproject.toml").exists():
            PROJECT_ROOT = current_path
            break
        current_path = current_path.parent
    else:
        # Fallback: assume standard structure
        PROJECT_ROOT = NOTEBOOK_DIR.parent.parent.parent.parent
    
    REPORTS_DIR = PROJECT_ROOT / "phases" / "01_LegacyDB" / "outputs" / "reports"
except Exception as e:
    print(f"Warning: Path detection failed ({e}). Using fallback path construction.")
    # Fallback to original approach
    PROJECT_ROOT = Path.cwd().parent.parent
    REPORTS_DIR = PROJECT_ROOT / "outputs" / "reports"

# --- Styling and Display Options ---
pd.set_option('display.max_columns', 100)
pd.set_option('display.float_format', '{:.2f}'.format)

def display_header(title):
    display(Markdown(f'### {title}'))

print("✅ Setup complete.")
print(f"Project Root: {PROJECT_ROOT}")
print(f"Reports Directory: {REPORTS_DIR}")
print(f"Reports Directory exists: {REPORTS_DIR.exists()}")
✅ Setup complete.
Project Root: C:\Users\rcesa\ASU Dropbox\Rudolf Cesaretti\GitHubRepos\TeoMappingProject
Reports Directory: C:\Users\rcesa\ASU Dropbox\Rudolf Cesaretti\GitHubRepos\TeoMappingProject\phases\01_LegacyDB\outputs\reports
Reports Directory exists: True

2. Data Loading¶

We load two key outputs from the 04_run_comparison.py script:

  1. comparison_matrix.csv: High-level summary metrics.
  2. report_performance_summary_detailed.csv: The enriched, long-form performance data with calculated comparative metrics.
In [2]:
matrix_path = REPORTS_DIR / 'comparison_matrix.csv'
perf_path = REPORTS_DIR / 'report_performance_summary_detailed.csv'

# Validate file existence with detailed error reporting
missing_files = []
if not matrix_path.exists():
    missing_files.append(str(matrix_path))
if not perf_path.exists():
    missing_files.append(str(perf_path))

if missing_files:
    print(f"Directory contents: {list(REPORTS_DIR.glob('*')) if REPORTS_DIR.exists() else 'Directory does not exist'}")
    raise FileNotFoundError(f"Critical Error: The following report files were not found:\n" + 
                           "\n".join(f"  - {f}" for f in missing_files) + 
                           f"\n\nSearched in: {REPORTS_DIR}\n" +
                           "Please run the 04_run_comparison.py script first to generate these files.")

try:
    # Load the matrix and transpose it so databases are rows
    comparison_df = pd.read_csv(matrix_path, index_col=0).T.reset_index().rename(columns={'index': 'Database'})
    print(f"✅ Successfully loaded comparison matrix: {comparison_df.shape[0]} databases, {comparison_df.shape[1]} metrics")
    
    # Load the detailed performance data
    perf_summary_df = pd.read_csv(perf_path)
    print(f"✅ Successfully loaded performance data: {perf_summary_df.shape[0]} records")
    
except Exception as e:
    print(f"Error loading data files: {e}")
    raise

print("\nLoaded Comparison Matrix:")
display(comparison_df)

print("\nLoaded Detailed Performance Summary Data:")
display(perf_summary_df.head())
print(f"\nPerformance data columns: {list(perf_summary_df.columns)}")
print(f"Unique databases in performance data: {perf_summary_df['database'].unique() if 'database' in perf_summary_df.columns else 'No database column found'}")
✅ Successfully loaded comparison matrix: 6 databases, 9 metrics
✅ Successfully loaded performance data: 18 records

Loaded Comparison Matrix:
Database Database Size (MB) Table Count View Count Total Estimated Rows Total Index Count JDI (Join Dependency Index) LIF (Logical Interop. Factor) NF (Normalization Factor)
0 TMP_DF10 64.00 9.00 0.00 485797.00 29.00 0.28 7.00 0.25
1 TMP_DF8 20.00 27.00 0.00 136350.00 27.00 0.07 2.00 0.21
2 TMP_DF9 28.00 62.00 0.00 106109.00 196.00 0.07 9.00 0.35
3 TMP_REAN_DF2 14.00 13.00 0.00 65715.00 13.00 0.15 1.00 0.19
4 tmp_benchmark_wide_numeric 21.00 1.00 0.00 5050.00 0.00 NaN NaN NaN
5 tmp_benchmark_wide_text_nulls 62.00 1.00 0.00 5050.00 19.00 NaN NaN NaN
Loaded Detailed Performance Summary Data:
query_name status latency_ms records_returned error_message executed_sql database is_benchmark category query_id baseline_latency_ms schema_efficiency_factor performance_improvement_factor
0 Baseline Performance - Query 1.1 Success 1.48 1 NaN SELECT COUNT(*) FROM public.wide_format_data; tmp_benchmark_wide_numeric True Baseline Performance Query 1.1 1.48 1.00 0.00
1 Join_performance Performance - Query 2.1 Success 12.03 4595 NaN SELECT\r\n "site",\r\n "subsite",\r\n ... tmp_benchmark_wide_numeric True Join_performance Performance Query 2.1 12.03 1.00 0.00
2 Complex_filtering Performance - Query 3.1 Success 2.50 1 NaN SELECT SUM("obsidianBlades") AS total_obsidian... tmp_benchmark_wide_numeric True Complex_filtering Performance Query 3.1 1.94 1.29 0.00
3 Baseline Performance - Query 1.1 Success 2.03 1 NaN SELECT COUNT(*) FROM public.wide_format_data; tmp_benchmark_wide_text_nulls True Baseline Performance Query 1.1 1.48 1.37 0.00
4 Join_performance Performance - Query 2.1 Success 16.12 4595 NaN SELECT\r\n "site",\r\n "subsite",\r\n ... tmp_benchmark_wide_text_nulls True Join_performance Performance Query 2.1 12.03 1.34 0.00
Performance data columns: ['query_name', 'status', 'latency_ms', 'records_returned', 'error_message', 'executed_sql', 'database', 'is_benchmark', 'category', 'query_id', 'baseline_latency_ms', 'schema_efficiency_factor', 'performance_improvement_factor']
Unique databases in performance data: ['tmp_benchmark_wide_numeric' 'tmp_benchmark_wide_text_nulls' 'TMP_DF10'
 'TMP_DF8' 'TMP_DF9' 'TMP_REAN_DF2']

3. High-Level Comparison Matrix¶

A styled view of the main comparison matrix. Color gradients highlight high/low values for each metric, providing an at-a-glance summary.

  • Purple/Dark: Higher values
  • Yellow/Light: Lower values
In [3]:
display_header("Styled Comparison Matrix")

styled_df = comparison_df.style.background_gradient(cmap='viridis', axis=0)\
    .set_caption("Comparative Database Metrics")\
    .format('{:.2f}', subset=pd.IndexSlice[:, ['Database Size (MB)', 'JDI (Join Dependency Index)', 'NF (Normalization Factor)']])\
    .format('{:,.0f}', subset=pd.IndexSlice[:, ['Table Count', 'View Count', 'Total Estimated Rows', 'Total Index Count', 'LIF (Logical Interop. Factor)']])

display(styled_df)

Styled Comparison Matrix¶

Comparative Database Metrics
  Database Database Size (MB) Table Count View Count Total Estimated Rows Total Index Count JDI (Join Dependency Index) LIF (Logical Interop. Factor) NF (Normalization Factor)
0 TMP_DF10 64.00 9 0 485,797 29 0.28 7 0.25
1 TMP_DF8 20.00 27 0 136,350 27 0.07 2 0.21
2 TMP_DF9 28.00 62 0 106,109 196 0.07 9 0.35
3 TMP_REAN_DF2 14.00 13 0 65,715 13 0.15 1 0.19
4 tmp_benchmark_wide_numeric 21.00 1 0 5,050 0 nan nan nan
5 tmp_benchmark_wide_text_nulls 62.00 1 0 5,050 19 nan nan nan

4. Structural Complexity Analysis¶

This section focuses on the metrics that quantify the relational complexity and degree of normalization of the legacy schemas.

In [4]:
display_header("Schema Complexity Metrics (Legacy Databases)")

complexity_metrics = [
    'Database', 'Table Count', 
    'JDI (Join Dependency Index)', 
    'LIF (Logical Interop. Factor)',
    'NF (Normalization Factor)'
]
# Filter for legacy DBs only, as these metrics don't apply to the single-table benchmarks
legacy_df = comparison_df[~comparison_df['Database'].str.contains('benchmark')]
display(legacy_df[complexity_metrics])

# --- Advanced Visualization: Complexity Radar Plot ---
radar_metrics = ['Table Count', 'JDI (Join Dependency Index)', 'NF (Normalization Factor)']
radar_df = legacy_df[['Database'] + radar_metrics].copy()

# Normalize metrics to a 0-1 scale for fair comparison on the radar plot
scaler = MinMaxScaler()
radar_df[radar_metrics] = scaler.fit_transform(radar_df[radar_metrics])

fig = go.Figure()

for index, row in radar_df.iterrows():
    fig.add_trace(go.Scatterpolar(
        r=row[radar_metrics].values,
        theta=radar_metrics,
        fill='toself',
        name=row['Database']
    ))

fig.update_layout(
  polar=dict(
    radialaxis=dict(
      visible=True,
      range=[0, 1]
    )),
  showlegend=True,
  title='Normalized Complexity Profile of Legacy Databases'
)

fig.show()

Schema Complexity Metrics (Legacy Databases)¶

Database Table Count JDI (Join Dependency Index) LIF (Logical Interop. Factor) NF (Normalization Factor)
0 TMP_DF10 9.00 0.28 7.00 0.25
1 TMP_DF8 27.00 0.07 2.00 0.21
2 TMP_DF9 62.00 0.07 9.00 0.35
3 TMP_REAN_DF2 13.00 0.15 1.00 0.19

5. Query Performance Deep Dive¶

This is the most critical comparison. It directly measures the analytical query performance of the legacy normalized schemas against the denormalized wide-format benchmark schemas using the pre-calculated metrics from the 04_run_comparison.py script.

In [5]:
display_header("Schema Efficiency Factor by Query Category")

if not perf_summary_df.empty:
    # First, display the tabular data (NOT logarithmic)
    print("📊 Schema Efficiency Factor Data Table (Raw Values):")
    print("Note: Values > 1.0 indicate the database is slower than the benchmark baseline")
    
    # Create a pivot table for better readability
    if 'schema_efficiency_factor' in perf_summary_df.columns:
        efficiency_table = perf_summary_df.pivot_table(
            index='database', 
            columns='category', 
            values='schema_efficiency_factor',
            aggfunc='mean'
        ).round(2)
        
        # Add a summary column showing average across categories
        efficiency_table['Average'] = efficiency_table.mean(axis=1).round(2)
        
        # Style the table to highlight high values
        styled_efficiency = efficiency_table.style.background_gradient(cmap='Reds', axis=None)\
            .set_caption("Schema Efficiency Factors (Lower is Better - Benchmark = 1.0)")\
            .format('{:.2f}')
        
        display(styled_efficiency)
        
        # Also show summary statistics
        print(f"\n📈 Summary Statistics:")
        print(f"Worst performing database (highest average): {efficiency_table['Average'].idxmax()} ({efficiency_table['Average'].max():.2f}x slower)")
        print(f"Best performing database (lowest average): {efficiency_table['Average'].idxmin()} ({efficiency_table['Average'].min():.2f}x)")
    else:
        print("⚠️ 'schema_efficiency_factor' column not found in performance data")
        display(perf_summary_df)
    
    print("\n" + "="*50)
    print("📊 Schema Efficiency Factor Chart (Log Scale):")
    
    # Then display the chart with log scale
    fig = px.bar(perf_summary_df,
                 x='database', 
                 y='schema_efficiency_factor', 
                 color='category', 
                 barmode='group',
                 title='Schema Efficiency Factor (Lower is Better)',
                 labels={'schema_efficiency_factor': 'Efficiency Factor (Log Scale)', 'database': 'Database'},
                 category_orders={'category': ['baseline', 'join_performance', 'complex_filtering']})
    
    fig.update_yaxes(type="log") # Use a log scale as differences can be huge
    fig.add_hline(y=1.0, line_dash="dot", annotation_text="Benchmark Baseline", annotation_position="bottom right")
    fig.update_layout(height=600)
    fig.show()
else:
    print("No performance data to plot.")

Schema Efficiency Factor by Query Category¶

📊 Schema Efficiency Factor Data Table (Raw Values):
Note: Values > 1.0 indicate the database is slower than the benchmark baseline
Schema Efficiency Factors (Lower is Better - Benchmark = 1.0)
category Baseline Performance Complex_filtering Performance Join_performance Performance Average
database        
TMP_DF10 1.04 4.58 5.46 3.69
TMP_DF8 0.69 4.38 0.58 1.88
TMP_DF9 1.05 1.34 1.56 1.32
TMP_REAN_DF2 0.62 1.11 0.86 0.86
tmp_benchmark_wide_numeric 1.00 1.29 1.00 1.10
tmp_benchmark_wide_text_nulls 1.37 1.00 1.34 1.24
📈 Summary Statistics:
Worst performing database (highest average): TMP_DF10 (3.69x slower)
Best performing database (lowest average): TMP_REAN_DF2 (0.86x)

==================================================
📊 Schema Efficiency Factor Chart (Log Scale):
In [6]:
display_header("Performance Improvement vs. Best Benchmark")

if not perf_summary_df.empty:
    # First, display the tabular data
    print("📊 Performance Improvement Data Table:")
    print("Note: Positive values indicate how much faster the benchmark is compared to legacy databases")
    
    # Filter out the baseline databases themselves for cleaner analysis
    improvement_df = perf_summary_df[~perf_summary_df['database'].str.contains('benchmark')].copy()
    
    if not improvement_df.empty and 'performance_improvement_factor' in improvement_df.columns:
        # Create a summary table showing improvement factors
        improvement_summary = improvement_df.groupby(['database', 'category'])['performance_improvement_factor'].agg(['mean', 'min', 'max']).round(1)
        improvement_summary.columns = ['Avg_Improvement_%', 'Min_Improvement_%', 'Max_Improvement_%']
        
        # Reset index to make it more readable
        improvement_summary = improvement_summary.reset_index()
        improvement_pivot = improvement_summary.pivot(index='database', columns='category', values='Avg_Improvement_%').round(1)
        
        # Style the table
        styled_improvement = improvement_pivot.style.background_gradient(cmap='Greens', axis=None)\
            .set_caption("Average Performance Improvement (% faster than benchmark)")\
            .format('{:.1f}%')
        
        display(styled_improvement)
        
        # Show detailed breakdown
        print(f"\n📋 Detailed Performance Improvement Breakdown:")
        detailed_table = improvement_df[['database', 'category', 'query_id', 'performance_improvement_factor']].copy()
        detailed_table['performance_improvement_factor'] = detailed_table['performance_improvement_factor'].round(1)
        detailed_table = detailed_table.rename(columns={
            'performance_improvement_factor': 'Improvement_%'
        })
        
        # Sort by improvement factor for better readability
        detailed_table = detailed_table.sort_values(['database', 'category', 'Improvement_%'], ascending=[True, True, False])
        display(detailed_table)
        
        # Summary statistics
        print(f"\n📈 Key Insights:")
        best_db = improvement_pivot.mean(axis=1).idxmin()
        worst_db = improvement_pivot.mean(axis=1).idxmax()
        print(f"Most consistent performer: {best_db} (avg {improvement_pivot.mean(axis=1)[best_db]:.1f}% improvement over benchmark)")
        print(f"Least consistent performer: {worst_db} (avg {improvement_pivot.mean(axis=1)[worst_db]:.1f}% improvement over benchmark)")
        
    else:
        print("⚠️ 'performance_improvement_factor' column not found or no legacy database data available")
        if not improvement_df.empty:
            display(improvement_df)
    
    print("\n" + "="*50)
    print("📊 Performance Improvement Chart:")
    
    # Then display the chart
    fig = px.bar(improvement_df.sort_values('performance_improvement_factor') if not improvement_df.empty else improvement_df,
                 x='query_id', 
                 y='performance_improvement_factor', 
                 color='database', 
                 facet_row='category',
                 barmode='group',
                 title='Performance Improvement of Benchmark Schemas vs. Legacy Schemas',
                 labels={'performance_improvement_factor': '% Improvement vs. Benchmark', 'query_id': 'Query ID'})
    
    fig.update_layout(height=800)
    fig.show()
else:
    print("Could not generate performance improvement plot.")

Performance Improvement vs. Best Benchmark¶

📊 Performance Improvement Data Table:
Note: Positive values indicate how much faster the benchmark is compared to legacy databases
Average Performance Improvement (% faster than benchmark)
category Baseline Performance Complex_filtering Performance Join_performance Performance
database      
TMP_DF10 4.2% 78.2% 81.7%
TMP_DF8 -44.0% 77.2% -72.4%
TMP_DF9 4.9% 25.5% 35.9%
TMP_REAN_DF2 -61.7% 9.7% -16.9%
📋 Detailed Performance Improvement Breakdown:
database category query_id Improvement_%
6 TMP_DF10 Baseline Performance Query 1.1 4.20
8 TMP_DF10 Complex_filtering Performance Query 3.1 78.20
7 TMP_DF10 Join_performance Performance Query 2.1 81.70
9 TMP_DF8 Baseline Performance Query 1.1 -44.00
11 TMP_DF8 Complex_filtering Performance Query 3.1 77.20
10 TMP_DF8 Join_performance Performance Query 2.1 -72.40
12 TMP_DF9 Baseline Performance Query 1.1 4.90
14 TMP_DF9 Complex_filtering Performance Query 3.1 25.50
13 TMP_DF9 Join_performance Performance Query 2.1 35.90
15 TMP_REAN_DF2 Baseline Performance Query 1.1 -61.70
17 TMP_REAN_DF2 Complex_filtering Performance Query 3.1 9.70
16 TMP_REAN_DF2 Join_performance Performance Query 2.1 -16.90
📈 Key Insights:
Most consistent performer: TMP_REAN_DF2 (avg -23.0% improvement over benchmark)
Least consistent performer: TMP_DF10 (avg 54.7% improvement over benchmark)

==================================================
📊 Performance Improvement Chart:

6. Qualitative Architectural Trade-offs¶

The quantitative data above supports a qualitative assessment of the architectural trade-offs between the legacy design and the proposed wide-format design.

Feature Legacy Normalized (e.g., DF9) Proposed Wide-Format (Benchmark) Justification Based on Data
Query Performance Low High The 'Schema Efficiency Factor' chart shows legacy databases are multiple times slower.
Storage Cost Low High comparison_matrix.csv shows benchmark DBs are larger due to data duplication.
Schema Complexity High (High JDI/NF, Many Tables) Very Low (1 Table) The complexity radar plot visually confirms the high complexity scores of the legacy schemas.
Data Redundancy Low (Normalized) High (Denormalized) This is the inherent trade-off of the wide-format design; we trade storage for speed.
Ease of Use for BI/GIS Low (Requires complex joins) High (Single table source) A single flat table is trivial to connect to tools like QGIS, Tableau, or Power BI.

7. Final Analyst Summary & Recommendation¶

Instructions: Based on the comparative analysis, synthesize the findings and provide a formal recommendation for the Phase 2 unified database architecture. This summary will be a primary input for the final white paper.

Overarching Conclusion:¶

  • Start with a concise, definitive statement. Example: "The comparative analysis demonstrates conclusively that the highly normalized structure of the legacy databases, particularly tmp_df9, is quantitatively inferior for the project's analytical objectives compared to a denormalized, wide-format architecture."

Justification from Evidence:¶

  1. On Performance:
    • Quantify the performance difference. Reference the 'Schema Efficiency Factor' chart directly. Example: "As shown in the efficiency factor plot, the legacy schemas are between 5x and 50x slower for join-heavy queries than the wide-format benchmarks. This performance gap makes interactive analysis on the normalized schemas untenable."
  2. On Complexity:
    • Reference the complexity metrics and the radar plot. Example: "The legacy schemas exhibit high JDI and NF scores, indicative of significant relational complexity that increases the cognitive load for analysts and the technical barrier for connecting to BI and GIS tools. The radar plot clearly visualizes tmp_df9 as the most complex outlier."
  3. On The Cost/Benefit Trade-off:
    • Acknowledge the trade-offs identified in the qualitative table. Example: "While the wide-format approach increases storage costs due to data redundancy, this trade-off is strategically acceptable. The cost of storage is minimal compared to the significant gains in query performance and the drastic reduction in development time and analytical friction for end-users."

Formal Recommendation:¶

  • State the final recommendation clearly and unambiguously.
  • Recommended Architecture: "It is the formal recommendation of this analysis that Phase 2 of the Digital TMP project proceeds with the development of a single, denormalized, wide-format primary analytical table. This table should be based on the schema of the tmp_benchmark_wide_text_nulls database, as it provides the best balance of performance and human-readability."*
  • Next Steps: "The next step should be to finalize the schema of this wide-format table, including data type assignments and column naming conventions, and to proceed with the development of the full ETL pipeline in Phase 2 to migrate all legacy data into this new structure."*